Skip to content

Comments

bm/perf-eliminate-redundant-parsing#91

Merged
datastx merged 1 commit intomainfrom
bm/perf-eliminate-redundant-parsing
Feb 23, 2026
Merged

bm/perf-eliminate-redundant-parsing#91
datastx merged 1 commit intomainfrom
bm/perf-eliminate-redundant-parsing

Conversation

@datastx
Copy link
Owner

@datastx datastx commented Feb 23, 2026

Summary

  • Eliminate YAML double-parse (31.8% of heap): process_node_dir() already reads YAML to probe the kind field, then discards content. Added load_from_str() variants to ModelSchema, SourceFile, and FunctionDef so loaders reuse the already-read content instead of re-reading from disk and re-parsing.
  • Incremental FeatherFlowProvider (11.6% of heap): propagate_schemas() was rebuilding a new FeatherFlowProvider (converting all Arrow schemas) for every model in topo order — O(N²). Now builds once before the loop and calls insert_schema() incrementally — O(N).
  • Eliminate SQL double-parse in qualify (9.1% of heap): qualify_table_references() re-parsed SQL that was already parsed in compile_model_phase1(). Added qualify_statements() that operates directly on the existing AST. CompileOutput now carries parsed statements through to the qualification phase.

Benchmark Results

Benchmark Change
project_load -13.6%
dag_topological_sort -10.1%
propagate_schemas_small -5.2%
parse_complex_join -5.7%
qualify_table_references -4.5%
dag_build -5.0%

All improvements confirmed statistically significant by criterion (p < 0.05).

Test plan

  • make test — all 1,092 tests pass
  • make bench — criterion benchmarks confirm improvements
  • make ci-e2e — end-to-end test harness
  • make profile-memory — verify heap reduction via dhat

🤖 Generated with Claude Code

Profiling showed 53% of heap allocations (6.7 MB / 12.7 MB) came from
three redundant parse operations. This commit eliminates all three:

1. YAML double-parse (31.8% of heap): process_node_dir() already reads
   YAML to probe the kind field, then discards it. Loaders re-read and
   re-parse the same file. Fix: add load_from_str() variants and pass
   the already-read content through to loaders.

2. FeatherFlowProvider rebuilt per model (11.6%): propagate_schemas()
   constructed a new provider for every model in topo order, rebuilding
   Arrow schema maps from scratch each time (O(N²)). Fix: build once
   before the loop, incrementally insert_schema() after each model.

3. SQL double-parse in qualify (9.1%): qualify_table_references()
   re-parses SQL that was already parsed in compile_model_phase1().
   Fix: add qualify_statements() that operates on the existing AST,
   store parsed statements in CompileOutput.

Benchmarks: project_load -13.6%, propagate_schemas -5.2%,
qualify_table_references -4.5%, dag_topological_sort -10.1%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@datastx datastx merged commit c3f02ce into main Feb 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant